Using Correlation Dimension for Analysing Text Data
نویسندگان
چکیده
In this article, we study the scale-dependent dimensionality properties and overall structure of text data with a method that measures correlation dimension in different scales. As experimental results, we present the analysis of text data sets with the Reuters and Europarl corpora, which are also compared to artificially generated point sets. A comparison is also made with speech data. The results reflect some of the typical properties of the data and the use of our method in improving various data analysis applications is discussed.
منابع مشابه
Semi-parametric Quantile Regression for Analysing Continuous Longitudinal Responses
Recently, quantile regression (QR) models are often applied for longitudinal data analysis. When the distribution of responses seems to be skew and asymmetric due to outliers and heavy-tails, QR models may work suitably. In this paper, a semi-parametric quantile regression model is developed for analysing continuous longitudinal responses. The error term's distribution is assumed to be Asymmetr...
متن کاملGenre classification for a corpus of academic webpages
In this paper we report our analysis of the similarities between webpages that are crawled from European academic websites, and comparison of their distribution in terms of the English language variety (native English vs English as a lingua franca) and their language family (based on the country’s official language). After building a corpus of university webpages, we selected a set of relevant ...
متن کاملA non subjective approach to the GP algorithm for analysing noisy time series
We present an adaptation of the standard Grassberger-Proccacia (GP) algorithm for estimating the Correlation Dimension of a time series in a non subjective manner. The validity and accuracy of this approach is tested using different types of time series, such as, those from standard chaotic systems, pure white and colored noise and chaotic systems added with noise. The effectiveness of the sche...
متن کاملBayesian paradigm for analysing count data in longitudina studies using Poisson-generalized log-gamma model
In analyzing longitudinal data with counted responses, normal distribution is usually used for distribution of the random efffects. However, in some applications random effects may not be normally distributed. Misspecification of this distribution may cause reduction of efficiency of estimators. In this paper, a generalized log-gamma distribution is used for the random effects which includes th...
متن کاملUsing Complex Argumentative Interactions to Reconstruct the Argumentative Structure of Large-Scale Debates
In this paper we consider the insights that can be gained by considering large scale argument networks and the complex interactions between their constituent propositions. We investigate metrics for analysing properties of these networks, illustrating these using a corpus of arguments taken from the 2016 US Presidential Debates. We present techniques for determining these features directly from...
متن کامل